Proximal Policy Optimization 2 (PPO2) — from scratch in PyTorch#

This notebook builds a low-level PPO2 implementation in PyTorch and uses it to train an agent on a classic control environment.


Learning goals#

By the end you should be able to:

  • derive the PPO2 clipped objective and connect it to a trust-region intuition

  • implement PPO2 (rollout → GAE → multi-epoch mini-batch updates) in raw PyTorch

  • understand exactly how PPO2 differs from PPO1 (both in the paper and in Stable-Baselines naming)

  • plot episodic rewards and training diagnostics with Plotly


Prerequisites#

  • comfortable with gradients and backprop

  • basic RL notation: policy \(\pi_\theta(a\mid s)\), returns, value function \(V_\phi(s)\)

  • packages: torch, gymnasium (or gym), numpy, plotly

import math
import time
from dataclasses import dataclass
from typing import Dict, List, Optional, Tuple

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical, Independent, Normal

# Gymnasium first (new API), fallback to Gym (old API)
try:
    import gymnasium as gym
except Exception:  # pragma: no cover
    import gym  # type: ignore

pio.templates.default = 'plotly_white'
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

SEED = 42
rng = np.random.default_rng(SEED)

torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device
/home/tempa/miniconda3/lib/python3.12/site-packages/torch/cuda/__init__.py:174: UserWarning:

CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
device(type='cpu')

1) The RL objective (notation)#

We’ll use the standard episodic discounted-return objective:

\[ J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\Big[\sum_{t=0}^{T-1} \gamma^t r_t\Big] \]
  • \(\tau = (s_0, a_0, r_0, s_1, \dots)\) is a trajectory sampled by following the policy.

  • \(\gamma \in (0, 1]\) is the discount factor.

Two key helper objects:

  • Value function: \(V_\phi(s) \approx \mathbb{E}[\sum_{k\ge 0} \gamma^k r_{t+k} \mid s_t=s]\)

  • Advantage: \(A_t = Q(s_t, a_t) - V(s_t)\) — “how much better was this action than average?”

2) Policy gradients in one equation#

The policy-gradient theorem motivates the surrogate objective:

\[ \nabla_\theta J(\theta) = \mathbb{E}_t\big[\nabla_\theta \log \pi_\theta(a_t \mid s_t)\, A_t\big] \]

In practice we:

  1. sample data with an old policy \(\pi_{\theta_{\text{old}}}\)

  2. estimate advantages \(\hat{A}_t\) (often via GAE)

  3. update the policy using mini-batch SGD.

3) Why PPO exists: “big steps” break policy gradients#

A vanilla policy-gradient update can change the policy too much.

PPO controls this by comparing the new policy to the old policy using the probability ratio:

\[ r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)} \]

If \(r_t(\theta)=1\) the new policy agrees with the old policy on that sampled action.

The classic importance-sampled surrogate (CPI) is:

\[ L^{\text{CPI}}(\theta) = \mathbb{E}_t\big[r_t(\theta)\,\hat{A}_t\big] \]

The problem: maximizing this can push \(r_t\) to extreme values — effectively taking a too-large policy update.

4) PPO1 vs PPO2 (be precise about naming)#

People use “PPO1” vs “PPO2” in two different ways:

A) In the PPO paper (algorithmic variants)#

  • PPO-Penalty: adds a KL penalty \(\beta\,\mathrm{KL}(\pi_{\text{old}}\,\|\,\pi_\theta)\) and adapts \(\beta\).

  • PPO-Clip: uses a clipped surrogate objective (no explicit KL penalty term).

A common PPO-Penalty surrogate is:

\[ L^{\text{KLPEN}}(\theta) = \mathbb{E}_t\Big[r_t(\theta)\,\hat{A}_t - \beta\,\mathrm{KL}\big(\pi_{\theta_{\text{old}}}(\cdot\mid s_t)\,\|\,\pi_{\theta}(\cdot\mid s_t)\big)\Big] \]

with \(\beta\) tuned (often adaptively) to keep the KL near a target. PPO-Clip instead bakes the “keep it close” constraint into the objective via clipping.

Many blogs call these “PPO1” (penalty) and “PPO2” (clip). When this notebook says PPO2, it means PPO-Clip.

B) In OpenAI Baselines / Stable-Baselines (implementation families)#

Stable-Baselines historically exposes two codebases:

  • PPO1: an older MPI-oriented implementation (requires mpi4py), with different batching and optimizer plumbing.

  • PPO2: a newer implementation that supports vectorized envs and (optionally) value-function clipping (cliprange_vf).

Important nuance: Stable-Baselines PPO1 also uses the clipped surrogate; the “1 vs 2” there is mostly engineering, not the core objective.

Concretely, in Stable-Baselines:

  • PPO1 is documented as an “MPI version”, with hyperparameters like timesteps_per_actorbatch, optim_stepsize, optim_batchsize, and a learning-rate schedule.

  • PPO2 is documented as a “GPU version”, with hyperparameters like n_steps (per env), nminibatches, noptepochs, and the extra cliprange_vf option for value clipping.

If you’re comparing results across implementations, these differences (batch construction + optimizer details + value clipping) can matter even when the high-level PPO objective looks similar.

5) PPO2 clipped objective (the main idea)#

PPO2 replaces the CPI surrogate with the clipped surrogate:

\[ L^{\text{CLIP}}(\theta) = \mathbb{E}_t\Big[\min\big(r_t(\theta)\,\hat{A}_t,\; \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\,\hat{A}_t\big)\Big] \]

Interpretation:

  • If \(\hat{A}_t > 0\) (action better than baseline), we don’t want \(r_t\) to grow far above \(1+\epsilon\).

  • If \(\hat{A}_t < 0\) (action worse than baseline), we don’t want \(r_t\) to shrink far below \(1-\epsilon\).

So PPO2 constrains the effective improvement you can get from any single sample.

Full loss (actor + critic + entropy)#

In practice we minimize the negative surrogate plus a value loss and an entropy bonus:

\[ \mathcal{L}(\theta,\phi) = -L^{\text{CLIP}}(\theta) + c_v\,\mathbb{E}_t[(V_\phi(s_t) - \hat{R}_t)^2] - c_e\,\mathbb{E}_t[\mathcal{H}(\pi_\theta(\cdot\mid s_t))] \]

where \(\hat{R}_t\) are “return targets” (often \(\hat{A}_t + V(s_t)\)).

Value function clipping (SB/OpenAI variant)#

Stable-Baselines PPO2 optionally clips value updates (not in the original PPO paper):

\[ V^{\text{clip}}(s_t) = V_{\text{old}}(s_t) + \mathrm{clip}(V(s_t)-V_{\text{old}}(s_t), -\epsilon_v, \epsilon_v) \]

and uses the max of the unclipped/clipped squared error.

# Visual intuition: how clipping changes the surrogate
eps = 0.2
ratios = np.linspace(0.0, 2.0, 600)

A_pos = 1.0
A_neg = -1.0

def clipped_surrogate(r, A, eps):
    r_clipped = np.clip(r, 1.0 - eps, 1.0 + eps)
    return np.minimum(r * A, r_clipped * A)

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=(
        'Surrogate term when $A_t > 0$',
        'Surrogate term when $A_t < 0$',
    ),
)

for col, A in [(1, A_pos), (2, A_neg)]:
    fig.add_trace(
        go.Scatter(x=ratios, y=ratios * A, name='CPI: $rA$', line=dict(width=2)),
        row=1,
        col=col,
    )
    fig.add_trace(
        go.Scatter(
            x=ratios,
            y=clipped_surrogate(ratios, A, eps),
            name='PPO2: $\min(rA, \mathrm{clip}(r)A)$',
            line=dict(width=3),
        ),
        row=1,
        col=col,
    )
    fig.add_vline(x=1.0 - eps, line=dict(color='gray', dash='dot'), row=1, col=col)
    fig.add_vline(x=1.0 + eps, line=dict(color='gray', dash='dot'), row=1, col=col)

fig.update_layout(
    title='PPO2 clipping limits how much any sample can improve the objective',
    xaxis_title='$r_t(\theta)$',
    height=380,
    legend=dict(orientation='h', yanchor='bottom', y=-0.25, xanchor='left', x=0.0),
)
fig.update_xaxes(range=[0.0, 2.0])
fig.show()
<>:31: SyntaxWarning:

invalid escape sequence '\m'

<>:31: SyntaxWarning:

invalid escape sequence '\m'

/tmp/ipykernel_2186937/206611049.py:31: SyntaxWarning:

invalid escape sequence '\m'

6) Advantage estimation: GAE(\(\lambda\))#

A practical choice is Generalized Advantage Estimation:

\[ \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) \]
\[ \hat{A}_t^{\text{GAE}(\lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l\,\delta_{t+l} \]
  • \(\lambda \to 0\) → low variance, higher bias (more like TD)

  • \(\lambda \to 1\) → lower bias, higher variance (more like Monte Carlo)

We’ll also use \(\hat{R}_t = \hat{A}_t + V(s_t)\) as the target return for the critic.

7) Implementation roadmap (what we’ll code)#

PPO2 training loop per update:

  1. Collect a rollout of length \(T\) (here: n_steps) with the current policy.

  2. Compute values \(V(s_t)\), log-probs \(\log\pi(a_t\mid s_t)\), and rewards.

  3. Compute GAE advantages \(\hat{A}_t\) and returns \(\hat{R}_t\).

  4. For n_epochs epochs:

    • shuffle the rollout into mini-batches

    • optimize the clipped policy objective + value loss + entropy bonus.

We’ll log:

  • episodic returns (what you care about)

  • policy loss, value loss, entropy

  • approximate KL and clip fraction (sanity checks)

def env_reset(env, *, seed: Optional[int] = None):
    out = env.reset(seed=seed) if seed is not None else env.reset()
    if isinstance(out, tuple) and len(out) == 2:
        obs, _info = out
        return obs
    return out


def env_step(env, action):
    out = env.step(action)
    # Gymnasium: (obs, reward, terminated, truncated, info)
    if isinstance(out, tuple) and len(out) == 5:
        obs, reward, terminated, truncated, info = out
        done = bool(terminated) or bool(truncated)
        return obs, float(reward), done, info
    # Gym: (obs, reward, done, info)
    obs, reward, done, info = out
    return obs, float(reward), bool(done), info


def set_seed_everywhere(seed: int):
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)


def explained_variance(y_pred: np.ndarray, y_true: np.ndarray) -> float:
    """1 - Var[y_true - y_pred] / Var[y_true]."""
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    var_y = np.var(y_true)
    if var_y < 1e-12:
        return float('nan')
    return float(1.0 - np.var(y_true - y_pred) / var_y)
class ActorCritic(nn.Module):
    def __init__(self, obs_dim: int, action_space, hidden_sizes=(64, 64)):
        super().__init__()
        self.obs_dim = int(obs_dim)
        self.action_space = action_space

        layers: List[nn.Module] = []
        in_dim = self.obs_dim
        for h in hidden_sizes:
            layers.append(nn.Linear(in_dim, h))
            layers.append(nn.Tanh())
            in_dim = h
        self.backbone = nn.Sequential(*layers)

        # Discrete actions: categorical over logits
        if isinstance(action_space, gym.spaces.Discrete):
            self.is_discrete = True
            self.n_actions = int(action_space.n)
            self.actor = nn.Linear(in_dim, self.n_actions)
            self.log_std = None
        # Continuous actions: diagonal Gaussian
        elif isinstance(action_space, gym.spaces.Box):
            self.is_discrete = False
            self.action_dim = int(np.prod(action_space.shape))
            self.actor_mean = nn.Linear(in_dim, self.action_dim)
            self.log_std = nn.Parameter(torch.zeros(self.action_dim))
        else:
            raise TypeError(f'Unsupported action space: {type(action_space)}')

        self.critic = nn.Linear(in_dim, 1)

    def _dist(self, obs: torch.Tensor):
        h = self.backbone(obs)
        if self.is_discrete:
            logits = self.actor(h)
            return Categorical(logits=logits)
        mean = self.actor_mean(h)
        std = torch.exp(self.log_std).expand_as(mean)
        return Independent(Normal(mean, std), 1)

    def value(self, obs: torch.Tensor) -> torch.Tensor:
        h = self.backbone(obs)
        return self.critic(h).squeeze(-1)

    def act(self, obs: torch.Tensor, action: Optional[torch.Tensor] = None):
        dist = self._dist(obs)
        if action is None:
            action = dist.sample()
        log_prob = dist.log_prob(action)
        entropy = dist.entropy()
        value = self.value(obs)
        return action, log_prob, entropy, value
@dataclass
class Rollout:
    obs: np.ndarray
    actions: np.ndarray
    log_probs: np.ndarray
    values: np.ndarray
    rewards: np.ndarray
    dones: np.ndarray


def make_rollout_storage(n_steps: int, obs_dim: int, action_space) -> Rollout:
    obs = np.zeros((n_steps, obs_dim), dtype=np.float32)
    rewards = np.zeros((n_steps,), dtype=np.float32)
    dones = np.zeros((n_steps,), dtype=np.float32)
    values = np.zeros((n_steps,), dtype=np.float32)
    log_probs = np.zeros((n_steps,), dtype=np.float32)

    if isinstance(action_space, gym.spaces.Discrete):
        actions = np.zeros((n_steps,), dtype=np.int64)
    elif isinstance(action_space, gym.spaces.Box):
        act_dim = int(np.prod(action_space.shape))
        actions = np.zeros((n_steps, act_dim), dtype=np.float32)
    else:
        raise TypeError(f'Unsupported action space: {type(action_space)}')

    return Rollout(obs=obs, actions=actions, log_probs=log_probs, values=values, rewards=rewards, dones=dones)
def compute_gae(
    rewards: np.ndarray,
    dones: np.ndarray,
    values: np.ndarray,
    next_value: float,
    *,
    gamma: float,
    gae_lambda: float,
) -> Tuple[np.ndarray, np.ndarray]:
    """Returns (advantages, returns)."""
    n_steps = len(rewards)
    advantages = np.zeros((n_steps,), dtype=np.float32)
    last_gae = 0.0
    for t in reversed(range(n_steps)):
        next_nonterminal = 1.0 - dones[t]
        next_v = next_value if t == n_steps - 1 else values[t + 1]
        delta = rewards[t] + gamma * next_v * next_nonterminal - values[t]
        last_gae = delta + gamma * gae_lambda * next_nonterminal * last_gae
        advantages[t] = last_gae
    returns = advantages + values
    return advantages, returns

8) PPO2 update step (PyTorch)#

The heart of PPO2 is computing:

  • the ratio \(r_t(\theta)\) using old and new log-probs

  • the clipped surrogate

  • the value loss (optionally clipped)

  • the entropy bonus

and then doing standard backprop + optimizer step.

def ppo2_update(
    model: ActorCritic,
    optimizer: torch.optim.Optimizer,
    *,
    obs: torch.Tensor,
    actions: torch.Tensor,
    old_log_probs: torch.Tensor,
    old_values: torch.Tensor,
    advantages: torch.Tensor,
    returns: torch.Tensor,
    clip_coef: float,
    vf_clip_coef: Optional[float],
    ent_coef: float,
    vf_coef: float,
    max_grad_norm: float,
) -> Dict[str, float]:
    action, log_prob, entropy, value = model.act(obs, action=actions)

    log_ratio = log_prob - old_log_probs
    ratio = torch.exp(log_ratio)

    # Policy loss (clipped)
    unclipped = ratio * advantages
    clipped = torch.clamp(ratio, 1.0 - clip_coef, 1.0 + clip_coef) * advantages
    policy_loss = -torch.mean(torch.min(unclipped, clipped))

    # Value loss (optionally clipped, SB/OpenAI variant)
    if vf_clip_coef is None:
        value_loss = 0.5 * F.mse_loss(value, returns)
    elif vf_clip_coef < 0:
        # match original PPO paper: no value clipping
        value_loss = 0.5 * F.mse_loss(value, returns)
    else:
        v_clipped = old_values + torch.clamp(value - old_values, -vf_clip_coef, vf_clip_coef)
        v_loss1 = (value - returns).pow(2)
        v_loss2 = (v_clipped - returns).pow(2)
        value_loss = 0.5 * torch.mean(torch.max(v_loss1, v_loss2))

    entropy_loss = -torch.mean(entropy)

    loss = policy_loss + vf_coef * value_loss + ent_coef * entropy_loss

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
    optimizer.step()

    approx_kl = torch.mean(-log_ratio).item()
    clipfrac = torch.mean((torch.abs(ratio - 1.0) > clip_coef).float()).item()

    return {
        'loss': float(loss.item()),
        'policy_loss': float(policy_loss.item()),
        'value_loss': float(value_loss.item()),
        'entropy': float(torch.mean(entropy).item()),
        'approx_kl': float(approx_kl),
        'clipfrac': float(clipfrac),
    }

9) Train PPO2 on CartPole-v1#

We’ll keep this as close as possible to the textbook PPO2 recipe:

  • rollout length: n_steps

  • multi-epoch mini-batch SGD updates

  • GAE(\(\lambda\)) advantages (normalized)

  • plot episodic rewards

Tip: CartPole is fast. If you try harder environments, prefer vectorized envs (parallel rollouts) for more stable gradient estimates.

def train_ppo2(
    *,
    env_id: str = 'CartPole-v1',
    total_timesteps: int = 150_000,
    n_steps: int = 2048,
    n_epochs: int = 10,
    minibatch_size: int = 64,
    gamma: float = 0.99,
    gae_lambda: float = 0.95,
    learning_rate: float = 3e-4,
    clip_coef: float = 0.2,
    vf_clip_coef: Optional[float] = None,
    ent_coef: float = 0.0,
    vf_coef: float = 0.5,
    max_grad_norm: float = 0.5,
    target_kl: Optional[float] = 0.03,
    seed: int = 42,
) -> Dict[str, List[float]]:
    set_seed_everywhere(seed)

    env = gym.make(env_id)
    obs0 = env_reset(env, seed=seed)
    obs_dim = int(np.prod(env.observation_space.shape))

    model = ActorCritic(obs_dim=obs_dim, action_space=env.action_space).to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, eps=1e-5)

    logs: Dict[str, List[float]] = {
        'timesteps': [],
        'episode_returns': [],
        'policy_loss': [],
        'value_loss': [],
        'entropy': [],
        'approx_kl': [],
        'clipfrac': [],
        'explained_variance': [],
    }

    obs = obs0
    ep_return = 0.0

    num_updates = math.ceil(total_timesteps / n_steps)
    global_step = 0

    for update in range(num_updates):
        # Linear schedules (common PPO2 choice)
        frac = 1.0 - (update / num_updates)
        lr_now = learning_rate * frac
        clip_now = clip_coef * frac
        for pg in optimizer.param_groups:
            pg['lr'] = lr_now

        rollout = make_rollout_storage(n_steps=n_steps, obs_dim=obs_dim, action_space=env.action_space)

        # Collect on-policy data
        for t in range(n_steps):
            rollout.obs[t] = np.asarray(obs, dtype=np.float32).reshape(-1)

            obs_t = torch.tensor(rollout.obs[t], dtype=torch.float32, device=device).unsqueeze(0)
            with torch.no_grad():
                action_t, logp_t, _ent_t, value_t = model.act(obs_t)

            if model.is_discrete:
                action = int(action_t.item())
            else:
                action = action_t.squeeze(0).cpu().numpy().astype(np.float32)

            next_obs, reward, done, _info = env_step(env, action)

            rollout.actions[t] = action
            rollout.log_probs[t] = float(logp_t.item())
            rollout.values[t] = float(value_t.item())
            rollout.rewards[t] = float(reward)
            rollout.dones[t] = float(done)

            ep_return += reward
            global_step += 1

            obs = next_obs
            if done:
                logs['episode_returns'].append(float(ep_return))
                ep_return = 0.0
                obs = env_reset(env)

        # Bootstrap value for the last observation
        obs_last = torch.tensor(np.asarray(obs, dtype=np.float32).reshape(-1), device=device).unsqueeze(0)
        with torch.no_grad():
            next_value = float(model.value(obs_last).item())

        adv_np, ret_np = compute_gae(
            rewards=rollout.rewards,
            dones=rollout.dones,
            values=rollout.values,
            next_value=next_value,
            gamma=gamma,
            gae_lambda=gae_lambda,
        )

        # Flatten batch tensors
        b_obs = torch.tensor(rollout.obs, dtype=torch.float32, device=device)
        if model.is_discrete:
            b_actions = torch.tensor(rollout.actions, dtype=torch.int64, device=device)
        else:
            b_actions = torch.tensor(rollout.actions, dtype=torch.float32, device=device)
        b_old_logp = torch.tensor(rollout.log_probs, dtype=torch.float32, device=device)
        b_old_values = torch.tensor(rollout.values, dtype=torch.float32, device=device)
        b_adv = torch.tensor(adv_np, dtype=torch.float32, device=device)
        b_returns = torch.tensor(ret_np, dtype=torch.float32, device=device)

        # Advantage normalization is standard PPO2 practice
        b_adv = (b_adv - b_adv.mean()) / (b_adv.std() + 1e-8)

        # PPO update: multiple epochs over the same on-policy batch
        batch_indices = np.arange(n_steps)

        metrics_accum = {
            'policy_loss': [],
            'value_loss': [],
            'entropy': [],
            'approx_kl': [],
            'clipfrac': [],
        }

        for epoch in range(n_epochs):
            rng.shuffle(batch_indices)

            for start in range(0, n_steps, minibatch_size):
                mb_idx = batch_indices[start : start + minibatch_size]

                out = ppo2_update(
                    model,
                    optimizer,
                    obs=b_obs[mb_idx],
                    actions=b_actions[mb_idx],
                    old_log_probs=b_old_logp[mb_idx],
                    old_values=b_old_values[mb_idx],
                    advantages=b_adv[mb_idx],
                    returns=b_returns[mb_idx],
                    clip_coef=float(clip_now),
                    vf_clip_coef=vf_clip_coef if vf_clip_coef is not None else None,
                    ent_coef=float(ent_coef),
                    vf_coef=float(vf_coef),
                    max_grad_norm=float(max_grad_norm),
                )

                for k in metrics_accum:
                    metrics_accum[k].append(out[k])

            # Optional early stopping if KL explodes (common safety valve)
            if target_kl is not None and np.mean(metrics_accum['approx_kl']) > 1.5 * target_kl:
                break

        # Logging at update granularity
        logs['timesteps'].append(float(global_step))
        logs['policy_loss'].append(float(np.mean(metrics_accum['policy_loss'])))
        logs['value_loss'].append(float(np.mean(metrics_accum['value_loss'])))
        logs['entropy'].append(float(np.mean(metrics_accum['entropy'])))
        logs['approx_kl'].append(float(np.mean(metrics_accum['approx_kl'])))
        logs['clipfrac'].append(float(np.mean(metrics_accum['clipfrac'])))
        logs['explained_variance'].append(explained_variance(rollout.values, ret_np))

    env.close()
    return logs
# Run training (adjust total_timesteps if you're on CPU and want it faster)
logs = train_ppo2(
    env_id='CartPole-v1',
    total_timesteps=120_000,
    n_steps=1024,
    n_epochs=10,
    minibatch_size=64,
    learning_rate=3e-4,
    ent_coef=0.0,
    vf_clip_coef=0.2,  # SB/OpenAI-style value clipping (set -1 to disable)
)

len(logs['episode_returns']), logs['episode_returns'][:5]
(726, [19.0, 24.0, 34.0, 26.0, 23.0])
# Plot episodic rewards (and a rolling mean)
episode_returns = np.asarray(logs['episode_returns'], dtype=np.float32)
episodes = np.arange(1, len(episode_returns) + 1)

window = 25
if len(episode_returns) >= window:
    rolling = np.convolve(episode_returns, np.ones(window) / window, mode='valid')
    rolling_x = np.arange(window, len(episode_returns) + 1)
else:
    rolling = episode_returns
    rolling_x = episodes

fig = go.Figure()
fig.add_trace(go.Scatter(x=episodes, y=episode_returns, mode='lines', name='Episode return'))
fig.add_trace(go.Scatter(x=rolling_x, y=rolling, mode='lines', name=f'Rolling mean ({window})', line=dict(width=4)))
fig.update_layout(
    title='PPO2 on CartPole-v1: episodic reward over training',
    xaxis_title='Episode',
    yaxis_title='Episodic return',
    height=420,
)
fig.show()
# Plot training diagnostics per update
df = {
    'update': np.arange(len(logs['timesteps'])),
    'timesteps': np.asarray(logs['timesteps']),
    'policy_loss': np.asarray(logs['policy_loss']),
    'value_loss': np.asarray(logs['value_loss']),
    'entropy': np.asarray(logs['entropy']),
    'approx_kl': np.asarray(logs['approx_kl']),
    'clipfrac': np.asarray(logs['clipfrac']),
    'explained_variance': np.asarray(logs['explained_variance']),
}

fig = make_subplots(
    rows=2,
    cols=3,
    subplot_titles=(
        'Policy loss',
        'Value loss',
        'Entropy',
        'Approx KL',
        'Clip fraction',
        'Explained variance',
    ),
)

def add_line(row, col, y, name):
    fig.add_trace(go.Scatter(x=df['update'], y=y, mode='lines', name=name), row=row, col=col)

add_line(1, 1, df['policy_loss'], 'policy_loss')
add_line(1, 2, df['value_loss'], 'value_loss')
add_line(1, 3, df['entropy'], 'entropy')
add_line(2, 1, df['approx_kl'], 'approx_kl')
add_line(2, 2, df['clipfrac'], 'clipfrac')
add_line(2, 3, df['explained_variance'], 'explained_variance')

fig.update_layout(title='Training diagnostics (per PPO update)', height=560, showlegend=False)
fig.update_xaxes(title_text='Update')
fig.show()

10) Stable-Baselines PPO2 (reference implementation)#

Stable-Baselines (the TensorFlow library, now in maintenance mode) provides a PPO2 class.

Example from the Stable-Baselines docs (CartPole with a vectorized env):

import gym

from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common import make_vec_env
from stable_baselines import PPO2

env = make_vec_env('CartPole-v1', n_envs=4)
model = PPO2(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=25000)
model.save('ppo2_cartpole')

We’ll list and explain the Stable-Baselines PPO2 hyperparameters in the next section.

11) Stable-Baselines PPO2 hyperparameters (explained)#

Stable-Baselines PPO2 (TensorFlow) exposes the following constructor signature (from stable_baselines/ppo2/ppo2.py):

PPO2(
    policy,
    env,
    gamma=0.99,
    n_steps=128,
    ent_coef=0.01,
    learning_rate=2.5e-4,
    vf_coef=0.5,
    max_grad_norm=0.5,
    lam=0.95,
    nminibatches=4,
    noptepochs=4,
    cliprange=0.2,
    cliprange_vf=None,
    verbose=0,
    tensorboard_log=None,
    _init_setup_model=True,
    policy_kwargs=None,
    full_tensorboard_log=False,
    seed=None,
    n_cpu_tf_sess=None,
)

What each hyperparameter does#

  • policy: policy class (or registered string) like MlpPolicy, CnnPolicy, MlpLstmPolicy.

  • env: Gym env instance or an env id string (e.g. 'CartPole-v1').

  • gamma: discount factor \(\gamma\).

  • n_steps: rollout horizon per env per update. With vectorized envs, the batch size is:

    \[ n_{\text{batch}} = n_{\text{steps}} \cdot n_{\text{envs}} \]
  • ent_coef: entropy coefficient \(c_e\) (larger → more exploration pressure).

  • learning_rate: learning rate (float) or a schedule function of training progress.

  • vf_coef: value-loss coefficient \(c_v\).

  • max_grad_norm: global gradient norm clip threshold.

  • lam: GAE(\(\lambda\)) parameter.

  • nminibatches: number of minibatches per update (minibatch size is n_batch / nminibatches). For recurrent policies, SB recommends n_envs be a multiple of nminibatches.

  • noptepochs: number of epochs over the on-policy batch per update.

  • cliprange: PPO clip parameter \(\epsilon\) (float) or a schedule.

  • cliprange_vf: value-function clipping range.

    • None (default): reuse cliprange for the value function (OpenAI baselines legacy behavior).

    • negative value (e.g. -1): disable value clipping (closer to the original PPO paper).

    • positive float/schedule: enable value clipping with that range.

    Note: value clipping depends on reward scaling.

  • verbose: logging verbosity.

  • tensorboard_log: TensorBoard log directory (or None).

  • _init_setup_model: whether to build the TF graph at init.

  • policy_kwargs: extra kwargs forwarded to the policy network constructor.

  • full_tensorboard_log: log additional tensors/histograms (large disk usage).

  • seed: random seed (Python/NumPy/TF). For fully deterministic TF runs, SB notes you should set n_cpu_tf_sess=1.

  • n_cpu_tf_sess: number of TensorFlow threads.

Mapping to this notebook#

  • SB n_steps → this notebook’s n_steps

  • SB noptepochs → this notebook’s n_epochs

  • SB nminibatches → this notebook’s minibatch_size = n_steps / nminibatches (single-env case)

  • SB cliprange → this notebook’s clip_coef

  • SB cliprange_vf → this notebook’s vf_clip_coef